NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Minimalist Example of Edge-of-Stability and Progressive Sharpening

Liu, L; Zhang, Z; Du, S; Zhao, T (December 2025, Conference on Neural Information Processing Systems (NeurIPS) 2025)

Full Text Available
Evolution on the Lexical Workbench: Disentangling Frequency, Centrality, and Polysemy in Language Evolution.

Studdiford, Z; Liu, Q; Li, Y; Zhang, Z; Lupyan, G (July 2025, Proceedings of the Annual Meeting of the Cognitive Science Society)

Full Text Available
GPU-aware support to improve the performance of collective communication on GPUs

Cui, Y; Palla, A; Zhang, T; Wang, S; Roten, D; Koesterke, L; Zhang, Z; Maechling, P (September 2025, SCEC Publications)

We have implemented GPU-aware support across all AWP-ODC versions and enhanced message-passing collective communications for this memory-bound finite-difference solver. This provides cutting-edge communication support for production simulations on leadership-class computing facilities, including OLCF Frontier and TACC Vista. We achieved significant performance gains, reaching 37 sustained Petaflop/s and reducing time-to-solution by 17.2% using the GPU-aware feature on 8,192 Frontier nodes, or 65,336 MI250X GCDs. The AWP-ODC code has also been optimized for TACC Vista, an Arm-based NVIDIA GH200 Grace Hopper Superchip, demonstrating excellent application performance. This poster will showcase studies and GPU performance characteristics. We will discuss our verification of GPU-aware development and the use of high-performance MVAPICH libraries, including on-the-fly compression, on modern GPU clusters.
more » « less
Full Text Available
pFedSAM: Secure Federated Learning Against Backdoor Attacks via Personalized Sharpness-Aware Minimization

Zhang, Z; Guo, Y; Gong, Y (June 2025, IEEE International Conference on Communications (ICC))

Full Text Available
RADIUS: RANGE-BASED GRADIENT SPARSITY FOR LARGE FOUNDATION MODEL PRE-TRAINING

Zheng, M; Zhang, Z (May 2025, Eighth Conference on Machine Learning and Systems)

We present Radius, a gradient sparsity algorithm and system to accelerate large foundation model (FM) training while preserving downstream task performance. Radius leverages two key insights in large FM pre-training: 1) only a small portion of gradients contribute to the model updates in each iteration, and 2) the spatial distribution of the gradients with large magnitude is stable over time. Radius overcomes the scaling problem of existing top-k sparsity methods, as it maintains the structure of sparse gradients thus avoids dense communication. We examine the convergence and speed of Radius on pre-training GPT models (355M and 2.0B) in data-parallel and compare it with the baseline top-k sparsification methods. Our results show that using the existing top-k method with AdamW optimizer fails to converge, and the training speed improvement with sparse communication is marginal. In contrast, Radius with 40% sparsity reduces per-step training time by 21% (19% for overall training time) across 64 NVIDIA A100 GPUs that are connected by the Slingshot 11 interconnect while preserving the downstream task performance.
more » « less
Full Text Available
SLOPE: DOUBLE-PRUNED SPARSE PLUS LAZY LOW-RANK ADAPTER PRETRAINING OF LLMS

Mozaffari, M; Yazdanbakhsh, A; Zhang, Z; Dehnavi, M_Mehri (April 2025, The Thirteenth International Conference on Learning Representations)

We propose SLOPE, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLOPE improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLOPE uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLOPE accelerates the training and inference of models with billions of parameters up to 1.25→ and 1.54→ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to 0.63→ and 0.61→ for training and inference respectively.
more » « less
Full Text Available
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities.

Zhang, Z; Hu, F; Lee, L; Shi, F; Kordjamshidi, P; Chai, J; Ma, Z (April 2025, International Conference on Learning Representations (ICLR))

Full Text Available
CoCoNest: A continuous structural connectivity-based nested family of parcellations of the human cerebral cortex

Allen, A; Zhang, Z; Nobel, AB (December 2024, Network neuroscience)

Despite the widespread exploration and availability of parcellations for the functional connectome, parcellations designed for the structural connectome are comparatively limited. Current research suggests that there may be no single “correct” parcellation and that the human brain is intrinsically a multiresolution entity. In this work, we propose the Continuous Structural Connectivitity-based, Nested (CoCoNest) family of parcellations—a fully data-driven, multiresolution family of parcellations derived from structural connectome data. The CoCoNest family is created using agglomerative (bottom-up) clustering and error-complexity pruning, which strikes a balance between the complexity of each parcellation and how well it preserves patterns in vertex-level, high-resolution connectivity data. We draw on a comprehensive battery of internal and external evaluation metrics to show that the CoCoNest family is competitive with or outperforms widely used parcellations in the literature. Additionally, we show how the CoCoNest family can serve as an exploratory tool for researchers to investigate the multiresolution organization of the structural connectome.
more » « less
Full Text Available
Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics

Sun, L; Zhang, Z; Wang, Z; Wang, Y; Wan, Q; Li, H; Peng, H; Yu, Philip S (February 2025, Association for the Advancement of Artificial Intelligence (AAAI))

Full Text Available
APOLLO: SGD-like Memory, AdamW-level Performance

Zhu, H; Zhang, Z; Cong, W; Liu, X; Park, S; Chandra, V; Long, B; Pan, D Z; Wang, Z; Lee, J (February 2025, https://doi.org/10.48550/arXiv.2412.05270)

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.
more » « less
Full Text Available

« Prev Next »

Search for: All records